NSF PAR Search | NSF Public Access Repository

Matching Roles from Temporal Data: Why Joe Biden is not only President, but also Commander-in-Chief

https://doi.org/10.1145/3588919

Bornemann, Leon; Bleifuß, Tobias; Kalashnikov, Dmitri V.; Nargesian, Fatemeh; Naumann, Felix; Srivastava, Divesh (May 2023, Proceedings of the ACM on Management of Data)

We present role matching, a novel, fine-grained integrity constraint on temporal fact data, i.e., (subject, predicate, object, timestamp)-quadruples. A role is a combination of subject and predicate and can be associated with different objects as the real world evolves and the data changes over time. A role matching states that the associated object of two or more roles should always match across time. Once discovered, role matchings can serve as integrity constraints to improve data quality, for instance of structured data in Wikipedia[3]. If violated, role matchings can alert data owners or editors and thus allow them to correct the error. Finding all role matchings is challenging due both to the inherent quadratic complexity of the matching problem and the need to identify true matches based on the possibly short history of the facts observed so far. To address the first challenge, we introduce several blocking methods both for clean and dirty input data. For the second challenge, the matching stage, we show how the entity resolution method Ditto[27] can be adapted to achieve satisfactory performance for the role matching task. We evaluate our method on datasets from Wikipedia infoboxes, showing that our blocking approaches can achieve 95% recall, while maintaining a reduction ratio of more than 99.99%, even in the presence of dirty data. In the matching stage, we achieve a macro F1-score of 89% on our datasets, using automatically generated labels.

Full Text Available

Search for: All records